Reporting quality in preclinical animal experimental research in 2009 and 2018: A nationwide systematic investigation

Lack of translation and irreproducibility challenge preclinical animal research. Insufficient reporting methodologies to safeguard study quality is part of the reason. This nationwide study investigates the reporting prevalence of these methodologies and scrutinizes the reported information’s level of detail. Publications were from two time periods to convey any reporting progress and had at least one author affiliated to a Danish University. We retrieved all relevant animal experimental studies using a predefined research protocol and a systematic search. A random sampling of 250 studies from 2009 and 2018 led to 500 publications in total. Reporting of measures known to impact study results estimates were assessed. Part I discloses a simplified two-level scoring “yes/no” to identify the presence of reporting. Part II demonstrates an additional three-level scoring to analyze the reported information’s level of detail. Overall reporting prevalence is low, although minor improvements are noted. Reporting of randomization increased from 24.0% in 2009 to 40.8% in 2018, blinded experiment conduct from 2.4% to 4.4%, blinded outcome assessment from 23.6% to 38.0%, and sample size calculation from 3.2% to 14.0%. Poor reporting of details is striking with reporting of the random allocation method to groups being only 1.2% in 2009 and 6.0% in 2018. Reporting of sample size calculation method was 2.4% in 2009 and 7.6% in 2018. Only conflict-of-interest statements reporting increased from 37.6% in 2009 to 90.4%. Measures safeguarding study quality are poorly reported in publications affiliated with Danish research institutions. Only a modest improvement was noted during the period 2009–2018, and the lack of details urgently prompts institutional strategies to accelerate this. We suggest thorough teaching in designing, conducting and reporting animal studies. Education in systematic review methodology should be implemented in this training and will increase motivation and behavior working towards quality improvements in science.


Introduction
Poor reproducibility and translational failure in biomedical research lead to skepticism regarding the reliability of preclinical research findings. The reasons are multi-factorial [1][2][3][4]. A prevalent issue is unsatisfactory internal validity. Internal validity is the extent to which a design and conduct of a study eliminates the possibility of systematic errors (bias) [4]. Appropriate methodologies safeguarding against systematic errors can be implemented in the design, conduct, and analysis of an experiment in order to increase the internal validity [4].
Essential safeguards are blinding, randomization, and a thorough description of animals' and samples' flow including reasons for exclusion [5]. The judgment of the scientific evidence is hampered if these measures are poorly reported [6]. Evidence exists that lack of reporting corresponds to the absence of conduct [7,8]. Systematic reviews of preclinical animal studies disclose smaller effect sizes when randomization and blinding are implemented compared with studies not reporting these precautions [9][10][11][12]. This finding is corroborated in meta-epidemiological studies of clinical data that identify a negative additive impact when more than one safeguard is omitted [13][14][15][16]. Attrition bias (i.e., poor handling of dropouts) skews data and jeopardizes a study's scientific robustness. Holman et al. demonstrate that losses of only a few animals in a study can distort study effects [17]. A safeguard of importance is how the sample size is reached. Preclinical animal experiments often carry (too) small group sizes. A drawback to this is that positive findings may be due to chance rather than actual effect [18][19][20]. Thus, comprehensive sample size calculations based on the best available evidence are paramount. Other influential quality factors are, for example, the animals' health status or comorbidities before and during experiments, as undetected diseases may affect the study outcome [21][22][23].
The Animal Research: Reporting of In Vivo Experiments (ARRIVE) guidelines for animal experiments reporting, first published in 2010, provide recommendations on improving low reporting standards and have recently been updated [24,25]. However, studies repeatedly show inadequate reporting of quality indicators [26][27][28][29][30], suggesting the unsuccessful implementation of the guidelines, even though over 1000 journals endorse them. The implementation may be hindered by the lack of engagement of multiple stakeholders who all must engage in improving the reporting quality. In this context, the use of the ARRIVE guideline by researchers is necessary already at the planning stage to help improve experimental design and, in turn, improve reporting. Previous research has investigated the prevalence of reporting of measures to reduce the risk of bias for specific animal disease models or subjects of interest [28][29][30][31]. Other previous evaluations of preclinical reporting have provided an overview of the reporting status of items related to the internal validity or rigor of these experiments (e.g. blinding and randomization) [32]. This study investigates the reported information's level of detail by assessing preclinical studies within all animal experimental research fields with one or more authors affiliated with Danish research institutions. In part I of the study, we focus on the overall reporting status of methodological safeguards. In part II, the focus is on the level of detail given for each reported item. To detect whether progress over the years exists, we investigated publications containing experiments published before (the year 2009) and after (the year 2018) publication of the ARRIVE guidelines [24].

Materials and methods
The experimental design was based on random sampling to avoid bias. An equal number of studies from each year were included to compare the results between the two time periods. It was estimated that a thorough assessment of 500 papers in total-250 papers from each yearcould be performed within the given timeframe.

Study protocol
To further prevent methodological flaws and minimize bias, we modified a pre-specified systematic review protocol for animal intervention studies offered by the SYstematic Review Centre for Laboratory animal Experimentation (SYRCLE) [33]. The protocol was uploaded at SYRCLE (7 th of November 2018) for guidance and feedback and is found in the supporting information (S1 File).

Selection of studies
In collaboration with a library information specialist, we retrieved all potentially relevant studies using a modified, comprehensive search strategy [34,35]. The search was systematically performed in two databases, Medline (via PubMed) and Embase. All in vivo studies conducted in non-human vertebrates with one or more authors affiliated with at least one of five Danish universities of interest were retrieved. The search was divided into two separate searches based on publication year (search 1: 1 st of January 2018 until 6th November 2018; search 2: the year 2009). The studies were imported to two dedicated EndNote libraries (EndnoteX8, Clarivate Analytics, Philadelphia, USA), and duplicates were removed. One thousand, one hundred and sixty-one studies from 2009 and 1890 studies from 2018 were found.
The information from the Endnote libraries was copied to Excel (MS Office, version 2016, Microsoft Corp., USA), and the publications from each year were randomized using the "= RAND()" command, thereby allocating a unique random number to each publication. Due to the decision to perform a comprehensive search strategy to identify all relevant preclinical animal studies, the majority of the studies were not applicable. To meet the goal of including the 250 relevant publications from each year, publications were imported consecutively in the randomized order (first 500 studies each from the year 2009 and 2018, then 250 studies and lastly 150 studies from each year) into a systematic review manager software program, Covidence (Covidence, Melbourne, Australia) [36]. A total of 1800 studies (out of 3051) were screened for eligibility. Two hundred and fifty-six from 2009 and 275 from 2018 were found eligible. Of these, 250 publications from each year were selected based on the random sampling allocation sequence. The exclusion of studies was based on the following exclusion criteria: science related to farming, wild animals or invertebrates, environment, human (clinical) studies, in vitro research, not primary papers/publications, lack of abstract or full text, studies containing no intervention or no Danish author affiliation, and exploratory studies (the latter studies were identified through study author statements that the study was explorative, or studies were assessed to investigate novel questions and to be hypothesis-generating). Further information on the distribution of excluded studies is found in supporting information (S2 File). The flow diagram for random sampling, screening, and selection of publications is shown in Fig 1.

Data extraction and analysis
The Covidence Risk of Bias (RoB) tool was selectively modified for this study's aims and to assess reporting quality in compliance with SYRCLE's RoB tool for animal studies [37]. Each publication was assessed according to 10 items primarily based on the Landis four related to the quality of reporting of significant methodology and included in the ARRIVE guidelines [5,24,25]. The selection of items was due to the nature of the study capturing different types of animal research. One item "health status", was chosen since it, to our knowledge, is scarcely investigated even though it may influence many research outcomes [21]. The reporting quality form and algorithms for scoring are included in the supporting information (S1 Table). Two independent reviewers (KFP and JCS) assessed publications for reporting quality, each blinded to the other's assessment. Reviewers examined the full text of the articles, including figures and tables, and supplemental information, but references to other studies were not evaluated. Our approach for assessing the reporting quality included three steps: Step 1: To investigate the overall reporting status (Part I) of the selected items, each item was operationalized such that we scored a result of "Yes" or "No" in Covidence. Publications were qualitatively scored "Yes" if the specific item was reported or "No" when there was no reporting of the item or when criteria for "Yes" were not met. "Unclear" or "Partial" scores were not used. In instances where the item was only partially reported and did not contain the complete information defined in the item or where items were reported not conducted (e.g., authors reported that randomization was not conducted), the study was scored as "Yes" and notes provided. Details of this process are given in the supporting information (S1 Table). Each item's annotations and quotes were selected and saved for subsequent data quantification. This extra step made judgment decisions during this review consistent.
Step 2: After completing each reviewer's initial reporting quality assessments, a consensus of the reporting quality results was undertaken in Covidence. If both reviewers agreed on the item, the final judgment defaulted to the agreed value leaving discrepant items for further assessment. Discrepancies were resolved, and the consensus was reached through discussion and the inclusion of a third reviewer (BSK).
Step 3: After completing the assessment in Covidence, data were extracted and sorted in MS Excel. After that, a numerical score of 1, 2, 3, or 0-where 0 corresponds to no information-was given according to the quality of information (quotes and comments) saved for each item described in step 1 (Part II). Details of this process are provided in the supporting information S1 Table, and our criteria for aiding judgment and associated examples of quotes are found in the supporting information (S2 Table).
Survey data were analyzed using MS Excel and Stata Statistical Software: Release 16.1 (Stata Corp. 2019. College Station, TX: StataCorp LLC). Descriptive statistics were generated for all items and were presented in bar graphs and tables. Prevalence and differences between prevalence for the 2009 and 2018 studies were reported with 95% confidence intervals. Reviewer agreement and Cohen's Kappa values are disclosed in the supporting information (S3 File).

Results
The flow of the publications retrieved is described in the Methods section and in Fig 1. Five hundred publications were included in the investigation, 250 from 2009 and 250 from 2018 according to the procedure described in the Methods section. A simplified two-level scoring (reported "yes" or not reported "no") is given in part I. Part II discloses the results of a threelevel scoring (1, 2, and 3) system where one comprised the least detailed information conveyed. The results of part I are presented graphically in Fig 2 and the results of part II are shown in Table 1.

Part II: Level of detail of reported items
Further analysis of information is presented in Table 1 and revealed that of the publications reporting a sample size calculation, six (2.4%) from 2009 provided information regarding how the sample size was chosen and described the method employed. In 2018, this number was 19 (7.6%). The remaining publications that reported on sample size calculation either did not include a calculation method (1 (0.4%) from 2009 and 13 (5.2%) from 2018) or stated that a sample size calculation was not performed (1 (0.4%) from 2009 and 3 (1.2%) from 2018).
Of the publications from 2009 reporting on blinded conduct of the experiment, one publication (0.4%) reported that blinding was not conducted. This increased to seven publications (2.8%) in 2018.
Of the publications reporting on blinded outcome assessment, one publication (0.4%) from 2009 reported that blinding was not conducted. This increased to nine publications (3.6%) in 2018.
Further analysis of information regarding attrition revealed that only 81 (32.4%) of the publications reporting on numbers of samples or animals in the result section (attrition I) from 2009 reported exact numbers for all analyses. This number decreased to 74 (29.6%) in 2018. The remaining publications (114 (45.6%) from 2009 and 132 (52.8%) from 2018) either did not report exact numbers or did not report numbers for all analyses in the study.  [27]. We also found only modest improvements over time in reporting randomization, blinding, and sample size calculation, whereas the reporting of conflict of interest increased considerably. Our investigation highlights that while this topic has been extensively addressed both in the scientific community and through the development of reporting guidelines [24-31, 38, 39], reporting remains insufficient. There is still considerable room for improvement to strengthen the validity of most published pre-clinical animal studies in the light of the assumption that lack of reporting corresponds to limited conduct.
We further researched the level of detail in the information disclosed and found the level of detail was very limited. Randomization and blinding are essential methodological techniques to help reduce the influence of bias on the study outcome. Despite their importance, transparency in reporting these items was insufficient. In studies where blinding and randomization were not feasible, the reason (e.g., study design) was rarely justified nor considered a limitation and acknowledged in the study report. A description of why such a precaution is not taken will bring the reader's attention to the missing safeguard so the results can be judged accordingly. Many studies additionally have very complex study designs and the precautions taken to limit bias should be sufficiently reported.
In general, essential details related to randomization, such as the allocation method and sequence, were rarely conveyed. This fact is corroborated in studies of specific animal models of acute lung injury by Avey et al. They operationalized the ARRIVE guidelines to determine completeness in reporting and found no random sequence generation reporting [30]. Ting et al. similarly disclosed that no studies revealed the allocation method in experimental animal studies of rheumatology [28]. Our investigation concludes that there seems to be a general challenge across study fields.
Interestingly, small sample sizes may negatively influence successful randomization as groups may be unbalanced on critical prognostic variables. Underpowered experiments will give less precise estimates of treatment effects. This risk can be accounted for by using appropriate methods for sample size calculation.
Only a few publications provided sufficient information regarding if and how sample size was calculated for sample sizes' exact values. In some publications, historical precedent rather than reliable statistics formed the basis for reporting the number of animals per group. We are puzzled that such unjustified scientific information is forwarded through a review process. In a study by Gulin et al., investigating compliance with ARRIVE guidelines in studies of experimental animal models for Chagas disease, there was no reporting of sample size calculation. Authors of the investigation speculated that "animal numbers were more a matter of habit than a statistical decision" [29]. This speculation highlights that results may sometimes be due to chance rather than an actual effect. A more recent study of the veterinary literature that focused on reporting adherence to the ARRIVE guidelines found missing sample size calculations to be present in both ARRIVE guideline supporting and non-supporting journals, indicating that a journal's support for ARRIVE guidelines has not to date resulted in improved reporting of these guidelines and other essential indicators of study design quality [31]. If the planned sample size is not derived statistically, this should be explicitly stated along with the rationale for the intended sample size (e.g., exploratory nature). We found information on whether a study was confirmatory or exploratory sparse. This information poses an additional problem to how much weight can be ascribed to the published results.
Systematic differences between animals completing a study and the excluded animals can introduce bias to the study results-a bias known as attrition bias [17]. Despite the importance of emphasizing and reporting exact numbers of animals at the beginning of the study and the end of the study and how many animals were excluded during the study and for which reasons, we found most studies failed to report consistently. Most publications failed to report exact numbers and reasons for exclusion, and even a decrease in reporting of animal numbers in 2018 was seen when compared with 2009. Several studies reporting on the number of samples or animals used demonstrated inconsistencies in reporting between the methods section and the results section. Only one publication from 2018 included this information objectively in a flow chart compared to two publications in 2009. A flowchart illustrating each animal's fate and the derived samples or measurements would be effective in providing the reader with a thorough overview.
An uncommonly reported item and, to our knowledge, rarely investigated item is the animals' general health status. In our study, this was one of the most poorly reported items, and only one publication from 2018 included a health report with details of the specific agents for which the animals were screened. This finding is disturbing since infections and/or comorbidities influence disease outcomes in both preclinical animal research and treatment and pathology in patients [40]. Documenting these details is essential in understanding the discrepancies seen in laboratory results [25,41]. In our experience, many researchers do not take this fact into account. A fully disclosed health report should be mandatory and based on a case-oriented approach to the FELASA (Federation of European Laboratory Animal Science Associations) guidelines [22,23]. Moreover, the impact of animal health on study outcomes is complex and warrants further investigation.
We envisaged that an improvement in methodological reporting would be noticeable since many journals have endorsed the ARRIVE guidelines. However, advancements continue to progress at a slow pace or do not happen at all. We show that the reported information's level of detail is generally incomplete. The incomplete reporting of these details directly impedes the ability to assess the validity of the experiments. When research cannot be assessed on its methodological rigor, it becomes less valuable and thus is a waste of essential resources and animal lives. The translation of research findings into therapeutic applications becomes highly unreliable, and there is a high risk of guiding research in the wrong direction. Stakeholders such as funders and publishers may incite study quality, but perhaps the essential science stakeholders are researchers themselves. Researchers must conduct responsible research of high quality and have the ability to do so. This may also call for new evaluation methods [42].
Nevertheless, to conduct high-quality research, researchers need to be allocated time, understand the importance of research integrity, be trained in best practices, and know about the available tools, such as guidelines for planning and conducting animal-based studies [25,43]. Recently, a case study demonstrated the impact of conducting preclinical systematic reviews on the quality and transparency of research and researchers' awareness and motivation to promote change within their fields [44]. A critical comment was that many had not previously known how to report their research adequately, nor had they realized the importance of accurate reporting. Through systematic reviews, they became aware of the low reporting quality, and they became completer and more precise in the way they planned, executed, and reported their study. They also changed their view on the necessity to improve their team and research field. Hence, the assumption that this topic is well known and recognized among researchers may be wrong. There seems to be a need for more thorough education within this science field to implement rigor in one's preclinical animal study.
To accelerate progress, we conclude that educational institutions must look closer to home and support and increase educational activities of relevant teaching and training in designing and reporting animal studies. It is disappointing that large prestigious public research institutions fail to adequately report study characteristics (and also that the institutions assess publications with poor study quality at the same level as publications with high study quality). Proper education is necessary, and knowledge from education in systematic review methodology and conduction of randomized clinical control studies may guide how to approach the topic. Initiatives such as collaborative research groups and networks that serve as a backbone for this strategy should be prioritized.